🏆 LLM Benchmarking - emschwartz

Discussed on Hacker News

📋Text Quality GitHub·

MatteoLeonesi/claim-memory-graph-sdk: A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.

Discussed on Hacker News

🤖AI lmsys.org·

DFlash and Spec V2 Decoding (14 minute read)

Covers 5 stories including Looking for a self-hosted alternative to Modal.com for running ML workloads

Discussed on Hacker News

Less-relevant results

🕳LLM Vulnerabilities lesswrong.com·

Your Model Organisms Might Be Fried

Covers Teaching Claude why

🤖AI Machine Learning Blog·

Pre-Training Isn’t Bitter Enough

Covered by Deep Learning Weekly

🧠LLM Inference arxiv.org·

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

🤖AI huggingface.co·

Qwen and Fable: An open-weights agentic coding model. 35B Mixture-of-Experts

Covered by news.smol.ai

Discussed on Hacker News

🤖AI lesswrong.com·

Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

🛡️AI Safety arxiv.org·

Where Does Social Reasoning Come From? Capability Provenance in Language Models

🔧Developer tools infoworld.com·

Shipping enterprise-quality code with AI agents

Covers 2 stories including Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

📜Brooklyn History arxiv.org·

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

🇨🇳Chinese AI GitHub·

GoLongRL: Capability-Oriented Long Context RL with Multitask Alignment

Discussed on Hacker News

🆕New AI arxiv.org·

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

🤖AI lesswrong.com·

A Geometric Account of Activation Steering through Angle–Norm Decomposition

🔤Tokenization arxiv.org·

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

💾Prompt Caching arxiv.org·

Towards Distributed Inference of LLMs on a P2P Network

📊Statistical Ranking arxiv.org·

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

🦉Qwen arxiv.org·

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

🔤Tokenization arxiv.org·

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

Train LLM from Scratch

MatteoLeonesi/claim-memory-graph-sdk: A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.

DFlash and Spec V2 Decoding (14 minute read)

Your Model Organisms Might Be Fried

Pre-Training Isn’t Bitter Enough

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

Qwen and Fable: An open-weights agentic coding model. 35B Mixture-of-Experts

Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

Where Does Social Reasoning Come From? Capability Provenance in Language Models

Shipping enterprise-quality code with AI agents

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

GoLongRL: Capability-Oriented Long Context RL with Multitask Alignment

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

A Geometric Account of Activation Steering through Angle–Norm Decomposition

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

Towards Distributed Inference of LLMs on a P2P Network

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Surpassing Scale by Efficiency: A Compact 135M Parameter Foundational LLM Natively Adapted for the Bangla Language